Although some models have used more than a single risk factor, most research relies on traditional statistical approaches that restrict the number of variables that can be simultaneously examined, creating overly simplistic models (Franklin et al. 2017)
Theoretically, the processes that facilitate suicide morbidity are complex and entail multiple interactions; therefore, any risk factor considered in isolation will be an inaccurate predictor
A shift in research is needed to capture the complexities behind adolescent suicide morbidity
Using methods with better predictability performance
Risk algorithms instead of single risk factors
Machine learning in Suicidology
35 independent studies used ML to predict suicide-related events
More accurate levels of performance in predictions over traditional statistical methodology (AUCs = 0.80–0.84)
There are few studies using adolescent population
Research aims
Identify the critical risk factors for adolescent suicide morbidity from a set of 99 risk behavior predictors with machine learning classification algorithms.
Identify the best machine learning methodology to classify adolescents who attempted and considered suicide according to its classification performance (Receiver Operating Characteristic Curve, overall accuracy, and the Kappa value).
Compare the performance of an a priori-determined model to models informed by feature selection from the least absolute shrinkage and selection operator method.
Identify if there are differences in the critical risk factors for suicide ideation and suicide attempts.
Conceives human development as the constant interaction between the individual and the changing environment in which it lives and grows (Bronfenbrenner 1977).
Ontogenic
Sex
Race
Age
Microsystem
Family members
Friends
School
Exosystem
The media
Neighborhood
Macrosystem
Economic, social, educational, legal, and political systems
Allows to study adolescent suicide morbidity as the interaction of multiple risk factors at multiple levels of the adolescent system (Perkins and Hartless 2002)
Moves beyond the tendency to evaluate only individualistic characteristics of adolescents
Surveys that monitors health behaviors and experiences among high school students in grades 9–12 attending U.S. public and private schools since 1991 (Underwood et al. 2020)
The total weighted sample for the Combined YRBS High School Dataset is 14,395,146 cases
From these, 7,159,104 are female, and 7,141,727 are male
The proportion of students who reported attempting suicide in this data is 8%
The proportion of students who considered suicide is 15%
Outcomes:
(Q26) During the past 12 months, did you ever seriously consider attempting suicide?
(Q28) During the past 12 months, how many times did you actually attempt suicide?
Predictors:
Demographic variables (age, sex, grade, race, sexual identity, site, year)
Questionnaire items (q8-q99)
The main categories included in the survey
Behaviors that contribute to unintentional injury and violence
Tobacco use
Alcohol and other drug use
Sexual behaviors that contribute to unintended pregnancy and STD/HIV infection
Dietary behaviors
Physical inactivity
Logistic Regression, Lasso, K-Nearest Neighbors, Random Forest, Classification and Regression Trees, and Extreme Gradient Boosting will be used to generate the predictive models
The complete dataset will be divided into two datasets: 75% for training 25% for testing (Kuhn and Silge 2022).
The testing dataset will be set to make 10-fold cross-validation to tune by the relevant hyperparameters for each technique (Kuhn and Silge 2022)
The best model will be selected according to the highest value of receiver operating characteristic curve, overall accuracy, and Kappa value (Kuhn and Silge 2022)
Accuracy: is the fraction of predictions our model got right
ROC: A receiver operating characteristic curve, is a graphical plot that illustrates the diagnostic ability of a binary classifier. It is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings.
Kappa: how closely the instances classified by the machine learning classifier matched the data labeled as the truth
Model the outcome as a linear function of the predictors (Burkov 2019).
The sigmoid function is applied to adjust the predictions to stay between 0 and 1 (Burkov 2019)
The predictors will be selected from past literature modeling YRBSS data (Bae et al. 2005)
Logistic regression gifSource:Laken, Paul van der. 2020. “Animated Machine Learning Classifiers.” Paulvanderlaken.com. https://paulvanderlaken.com/2020/01/20/animated-machine-learning-classifiers/.
Select the subset of variables that minimizes prediction error.
Adds a penalty to the residual sum of squares.
The beta coefficients shrink toward zero
This technique will select only relevant coefficients (James et al. 2013).
Tries to predict the correct class for the test data by calculating the distance between the test data and all the training points.
Logistic regression gifSource:Laken, Paul van der. 2020. “Animated Machine Learning Classifiers.” Paulvanderlaken.com. https://paulvanderlaken.com/2020/01/20/animated-machine-learning-classifiers/.
Iterative process that splits the data into partitions or branches, and then continues splitting each partition into smaller groups (Greenwell 2022).
Logistic regression gifSource:Laken, Paul van der. 2020. “Animated Machine Learning Classifiers.” Paulvanderlaken.com. https://paulvanderlaken.com/2020/01/20/animated-machine-learning-classifiers/.
Random forest consists of hundreds or thousands of independently grown decision trees generated from different bootstrap samples from the training data (Greenwell 2022).
Uses hundreds of trees in the back end and thus results in a more flexible boundary
Logistic regression gifSource:Laken, Paul van der. 2020. “Animated Machine Learning Classifiers.” Paulvanderlaken.com. https://paulvanderlaken.com/2020/01/20/animated-machine-learning-classifiers/.
Same concept of Random Forest but..
Each additional tree added to the model partially fixes the errors made by the previous trees until the maximum number of trees are combined (Burkov 2019)